Efficient Minimal Perfect Hash Language Models
نویسندگان
چکیده
The recent availability of large collections of text such as the Google 1T 5-gram corpus (Brants and Franz, 2006) and the Gigaword corpus of newswire (Graff, 2003) have made it possible to build language models that incorporate counts of billions of n-grams. This paper proposes two new methods of efficiently storing large language models that allow O(1) random access and use significantly less space than all known approaches. We introduce two novel data structures that take advantage of the distribution of n-grams in corpora and make use of various numbers of minimal perfect hashes to compactly store language models containing full frequency counts of billions of n-grams using 2.5 Bytes per n-gram and language models of quantized probabilities using 2.26 Bytes per n-gram. We show that our approaches are simple to implement and can easily be combined with pruning and quantization to achieve additional reductions in the size of the language model.
منابع مشابه
Minimal Perfect Hash Rank: Compact Storage of Large N-gram Language Models
In this paper we propose a new method of compactly storing n-gram language models called Minimal Perfect Hash Rank (MPHR) that uses significantly less space than all known approaches. It requires O(n) construction time and allows for O(1) random access of probability values or frequency counts associated with n-grams. We make use of minimal perfect hashing to store fingerprints of n-grams in an...
متن کاملA Family of Perfect Hashing Methods
Minimal perfect hash functions are used for memory efficient storage and fast retrieval of items from static sets. We present an infinite family of efficient and practical algorithms for generating order preserving minimal perfect hash functions. We show that almost all members of the family construct space and time optimal order preserving minimal perfect hash functions, and we identify the on...
متن کاملA Survey on Efficient Hashing Techniques in Software Configuration Management
This paper presents a survey on efficient hashing techniques in software configuration management scenarios. Therefore it introduces in the most important hashing techniques as open hashing, separate chaining and minimal perfect hashing. Furthermore we evaluate those hashing techniques utilizing large data sets. Therefore we compare the hash functions in terms of time to build the data structur...
متن کاملA Simulated Annealing Algorithm for Generating Minimal Perfect Hash Functions
We developed minimal perfect hash functions for a variety of datasets using the probabilistic process of simulated annealing (SA). The SA solution structure is a tree representing an annealed program (algorithm). This solution structure is similar to the structure used in genetic programming. When executed, the SA program produces multiple hash functions for the given data set. An initial hash ...
متن کاملGraphs, Hypergraphs and Hashing
Minimal perfect hash functions are used for memory efficient storage and fast retrieval of items from static sets. We present an infinite family of efficient and practical algorithms for generating minimal perfect hash functions which allow an arbitrary order to be specified for the keys. We show that almost all members of the family are space and time optimal, and we identify the one with mini...
متن کامل